Previous
AI Solver/Optimizer
Contents
Table of Contents
Next
Book

17.4. Evaluation Metrics

17.4.1. Overview and Basics

Evaluation metrics are needed to quantitatively measure the performance of a model. Such performance indicators can be much more complicated than merely accuracy or success rate as we use in many daily-life applications. In fact, performance is usually a multi-facet and a relative term. That is, the measure of the model performance can be much different depending on what facet of the performance is more desirable to the model developer. In addition, the nature of the problem, i.e., classification (e.g., binary, multi-class, multi-label), regression, and clustering, also determines the available types and formulations of the evaluation metrics. Despite the discrepancy, there is a collection of metrics of good use. In the following, popular metrics for classification, regression, and clustering are presented.

17.4.2. Classification: Binary

For classification, some algorithms like SVM and KNN directly generate a class or label output. In a numeric setting, such output is usually either 0 or 1 in a binary classification problem. By contrast, many other algorithms directly output the probabilities of belonging to specific classifications/labels, such as Logistic Regression, Random Forest, and Gradient Boosting. These probability outputs can be easily converted to classification/labels using thresholds. It is noted that most of the following evaluation metrics are discussed using class/label output.

Confusion Matrix

The confusion matrix or error matrix is what we can start with to learn the common metrics for binary classification. A basic confusion matrix can be outlined as a table that contains the numbers of different types of samples classified according to their true labels and predicted labels (Fig. 17.2). More comprehensive confusion matrices can contain parameters constructed with these numbers. Confusion matrices have been widely used in statistics, data mining, and machine learning, as well as other AI applications.
The core of this matrix/table is the numbers of the four cases corresponding to different combinations of predicted and actual samples. Listed below is an illustration of the four cases using a widely-cited example of pregnancy diagnosis.
  • True Positive (TP): The person who was diagnosed to be pregnant is pregnant.
  • True Negative (TN): The person who was diagnosed not to be pregnant is not pregnant.
  • False Positives (FP): The person who was diagnosed to be pregnant is not pregnant (also known as a "Type I error").
  • False Negatives (FN): The person who was diagnosed to be not pregnant is pregnant (also known as a "Type II error").
The above four cases including the two types of errors are illustrated in Fig. 17.3. The relative significance of the two types of errors in this example is not that obvious. However, if we replace this example with the diagnosis of a detrimental disease like cancer, then Type II error "the person who was diagnosed to not have cancer has cancer" will cause a much more serious outcome than Type I error "the person who was diagnosed to have cancer does not have cancer". This is because the former threatens human lives. However, the condition may be the opposite if we apply it to the application of spam (email) detection. In this case, the Type II error "the email that was diagnosed not to be spam is spam" will cause a less serious outcome than the Type I error "the email that was diagnosed to be spam is not spam". This is because it is
Actual Positive (AP) Actual Negative (AN) Prevalence = A P Total = A P  Total  =(AP)/(" Total ")=\frac{A P}{\text { Total }}=AP Total  Accuracy = T P + T N T o t a l = T P + T N T o t a l =(TP+TN)/(Total)=\frac{T P+T N}{T o t a l}=TP+TNTotal
Predicted Positive (PP) True Positive (TP) False Positive (FP) Precision, Positive Predictive Value (PPV) = T P P P = T P P P =(TP)/(PP)=\frac{T P}{P P}=TPPP False Discovery Rate (FDR)
= F P P P = F P P P =(FP)/(PP)=\frac{F P}{P P}=FPPP
Predicted Negative (PN) False Negative (FN) True Negative (TN) False Omission Rate (FOR) = F N P N = F N P N =(FN)/(PN)=\frac{F N}{P N}=FNPN Negative Predictive Rate (NPV)
= T N P N = T N P N =(TN)/(PN)=\frac{T N}{P N}=TNPN
True Positive Rate (TPR), Recall, Sensitivity, Probability of Detection, Power = T P A P  Power  = T P A P {:[" Power "],[=(TP)/(AP)]:}\begin{aligned} & \text { Power } \\ & =\frac{T P}{A P} \end{aligned} Power =TPAP False Positive Rate (FPR), Fall-out, Probability of False Alarm = F P A N  Alarm  = F P A N {:[" Alarm "],[=(FP)/(AN)]:}\begin{aligned} & \text { Alarm } \\ & =\frac{F P}{A N} \end{aligned} Alarm =FPAN Positive Likelihood Ratio = T P R F P R = ( LR + ) A P A N = T P R F P R = ( LR + ) A P A N =(TPR)/(FPR)=((LR+))/(AP*AN)=\frac{T P R}{F P R}=\frac{(\mathrm{LR}+)}{A P \cdot A N}=TPRFPR=(LR+)APAN Diagnostic Odds Ratio (DOR)
= L R + L R = T P T N F P F N = L R + L R = T P T N F P F N {:[=(LR+)/(LR-)],[=(TP*TN)/(FP*FN)]:}\begin{aligned} & =\frac{L R+}{L R-} \\ & =\frac{T P \cdot T N}{F P \cdot F N} \end{aligned}=LR+LR=TPTNFPFN
F1 Score
= 2 Precision Recall Precision + Recall = 2 T P A P + P P = 2  Precision   Recall   Precision  +  Recall  = 2 T P A P + P P {:[=2*(" Precision "*" Recall ")/(" Precision "+" Recall ")],[=2*(TP)/(AP+PP)]:}\begin{gathered} =2 \cdot \frac{\text { Precision } \cdot \text { Recall }}{\text { Precision }+ \text { Recall }} \\ =2 \cdot \frac{T P}{A P+P P} \end{gathered}=2 Precision  Recall  Precision + Recall =2TPAP+PP
False Negative Rate (FNR), Miss Rate = F N A P = F N A P =(FN)/(AP)=\frac{F N}{A P}=FNAP True Negative Rate (TNR), Specificity (SPC), Selectivity, = T N A N = T N A N =(TN)/(AN)=\frac{T N}{A N}=TNAN Negative Likelihood Ratio (LR-) = F N R T N R = F N A N A P T N = F N R T N R = F N A N A P T N =(FNR)/(TNR)=(FN*AN)/(AP*TN)=\frac{F N R}{T N R}=\frac{F N \cdot A N}{A P \cdot T N}=FNRTNR=FNANAPTN
Actual Positive (AP) Actual Negative (AN) Prevalence =(AP)/(" Total ") Accuracy =(TP+TN)/(Total) Predicted Positive (PP) True Positive (TP) False Positive (FP) Precision, Positive Predictive Value (PPV) =(TP)/(PP) False Discovery Rate (FDR)=(FP)/(PP) Predicted Negative (PN) False Negative (FN) True Negative (TN) False Omission Rate (FOR) =(FN)/(PN) Negative Predictive Rate (NPV)=(TN)/(PN) True Positive Rate (TPR), Recall, Sensitivity, Probability of Detection, " Power =(TP)/(AP)" False Positive Rate (FPR), Fall-out, Probability of False " Alarm =(FP)/(AN)" Positive Likelihood Ratio =(TPR)/(FPR)=((LR+))/(AP*AN) Diagnostic Odds Ratio (DOR)"=(LR+)/(LR-) =(TP*TN)/(FP*FN)" F1 Score"=2*( Precision * Recall )/( Precision + Recall ) =2*(TP)/(AP+PP)" False Negative Rate (FNR), Miss Rate =(FN)/(AP) True Negative Rate (TNR), Specificity (SPC), Selectivity, =(TN)/(AN) Negative Likelihood Ratio (LR-) =(FNR)/(TNR)=(FN*AN)/(AP*TN) | | Actual Positive (AP) | Actual Negative (AN) | Prevalence $=\frac{A P}{\text { Total }}$ | | Accuracy $=\frac{T P+T N}{T o t a l}$ | | :---: | :---: | :---: | :---: | :---: | :---: | | Predicted Positive (PP) | True Positive (TP) | False Positive (FP) | Precision, Positive Predictive Value (PPV) $=\frac{T P}{P P}$ | False Discovery Rate (FDR)$=\frac{F P}{P P}$ | | | Predicted Negative (PN) | False Negative (FN) | True Negative (TN) | False Omission Rate (FOR) $=\frac{F N}{P N}$ | Negative Predictive Rate (NPV)$=\frac{T N}{P N}$ | | | | True Positive Rate (TPR), Recall, Sensitivity, Probability of Detection, $\begin{aligned} & \text { Power } \\ & =\frac{T P}{A P} \end{aligned}$ | False Positive Rate (FPR), Fall-out, Probability of False $\begin{aligned} & \text { Alarm } \\ & =\frac{F P}{A N} \end{aligned}$ | Positive Likelihood Ratio $=\frac{T P R}{F P R}=\frac{(\mathrm{LR}+)}{A P \cdot A N}$ | Diagnostic Odds Ratio (DOR)$\begin{aligned} & =\frac{L R+}{L R-} \\ & =\frac{T P \cdot T N}{F P \cdot F N} \end{aligned}$ | F1 Score$\begin{gathered} =2 \cdot \frac{\text { Precision } \cdot \text { Recall }}{\text { Precision }+ \text { Recall }} \\ =2 \cdot \frac{T P}{A P+P P} \end{gathered}$ | | | False Negative Rate (FNR), Miss Rate $=\frac{F N}{A P}$ | True Negative Rate (TNR), Specificity (SPC), Selectivity, $=\frac{T N}{A N}$ | Negative Likelihood Ratio (LR-) $=\frac{F N R}{T N R}=\frac{F N \cdot A N}{A P \cdot T N}$ | | |
Figure 17.2: Confusion matrix
less acceptable to have important emails to be marked as spam and missed. Therefore, the meanings and significance of different cases can vary a lot in different applications.
Figure 17.3: Two types of errors
As illustrated in Fig. 17.2, more metrics have been proposed to describe the performance of AI models from different angles. Among them, the most widely adopted ones are accuracy, sensitivity, specificity, and precision.
As the most intuitive performance metric, accuracy tells the percentage of correct predictions.
(17.119) Accuracy == T P + T N Total (17.119)  Accuracy  == T P + T N  Total  {:(17.119)" Accuracy "==(TP+TN)/(" Total "):}\begin{equation*} \text { Accuracy }==\frac{T P+T N}{\text { Total }} \tag{17.119} \end{equation*}(17.119) Accuracy ==TP+TN Total 
Sensitivity, which is also called recall, hit rate, or true positive rate, calculates the percentage of the positive samples that have been correctly detected ("recalled"). This metric measures how well the model recognizes a positive class.
(17.120) T P R = T P A P = T P T P + F N = 1 F N R (17.120) T P R = T P A P = T P T P + F N = 1 F N R {:(17.120)TPR=(TP)/(AP)=(TP)/(TP+FN)=1-FNR:}\begin{equation*} T P R=\frac{T P}{A P}=\frac{T P}{T P+F N}=1-F N R \tag{17.120} \end{equation*}(17.120)TPR=TPAP=TPTP+FN=1FNR
Specificity, selectivity, or true negative rate shows the percentage of actual negative samples that have been correctly detected. Specificity is defined as follows:
(17.121) T N R = T N A N = T N T N + F P = 1 F P R (17.121) T N R = T N A N = T N T N + F P = 1 F P R {:(17.121)TNR=(TN)/(AN)=(TN)/(TN+FP)=1-FPR:}\begin{equation*} T N R=\frac{T N}{A N}=\frac{T N}{T N+F P}=1-F P R \tag{17.121} \end{equation*}(17.121)TNR=TNAN=TNTN+FP=1FPR
Precision shows the probability that the predicted positive cases are correctly predicted. Precision or positive predictive value is formulated using the following equation:
(17.122) P P V = T P T P + F P = 1 F D R (17.122) P P V = T P T P + F P = 1 F D R {:(17.122)PPV=(TP)/(TP+FP)=1-FDR:}\begin{equation*} P P V=\frac{T P}{T P+F P}=1-F D R \tag{17.122} \end{equation*}(17.122)PPV=TPTP+FP=1FDR
These different parameters and their combinations can convey specific information about the model's performance. For example, a situation of high recall and low precision implies that there are a few false negative cases but lots of false positives. In other words, the model can identify most of the positive samples but is inclined to predict samples as positive. A combination of low recall and high precision indicates that we miss a lot of positive samples, i.e., high false-negative. However, those we predict as positive are mostly positive predictions.
Besides the above quantitative descriptions using the basic metrics, composite metrics can also be constructed to consider more facets of the model's performance. One of the most popular composite metrics is the F1 Score, which can assess precision and recall simultaneously. As the harmonic mean for precision and recall values, the F1 score is formulated as follows:
(17.123) F 1 = ( recall 1 + precision 1 2 ) 1 = 2 precision recall precision + recall (17.123) F 1 =  recall  1 +  precision  1 2 1 = 2  precision   recall   precision  +  recall  {:(17.123)F_(1)=((" recall "^(-1)+" precision "^(-1))/(2))^(-1)=2*(" precision "*" recall ")/(" precision "+" recall "):}\begin{equation*} F_{1}=\left(\frac{\text { recall }^{-1}+\text { precision }^{-1}}{2}\right)^{-1}=2 \cdot \frac{\text { precision } \cdot \text { recall }}{\text { precision }+ \text { recall }} \tag{17.123} \end{equation*}(17.123)F1=( recall 1+ precision 12)1=2 precision  recall  precision + recall 
The harmonic mean instead of the geometric or arithmetic mean is selected to avoid the most extreme values. A higher F1 score implies a higher predictive power of the classification model. An F1 value close to 1 means a perfect model, while a score close to 0 indicates the minimal predictive capability of the model. More comprehensive and powerful tools like ROC and AUC will be described in the following subsection.
The following code illustrates how to easily obtain a simple confusion matrix and an F1 score using Python code:
from sklearn import datasets # Dateset
              from sklearn.neighbors import KNeighborsClassifier
              from sklearn.metrics import confusion_matrix, f1_score # Metric
              # Loading the dataset
              X, y_true = datasets.make_moons(n_samples=500,noise=0.3,random_state=10) # Generate 500 samples
              # Fitting using K-Nearest Neighbors Classifier
              knnc = KNeighborsClassifier(n_neighbors=2).fit(X,Y_true)
              Y_pred = knnc.predict(X)
              # Print evaluation result using the metric
              print(confusion_matrix(y_true,y_pred))
              print(f1_score(y_true,y_pred))
              
In newer versions of Scikit-learn, the confusion matrix can also be conveniently plotted as follows.
from sklearn.metrics import plot_confusion_matrix
              plot_confusion_matrix(y_true,y_pred)
              

ROC and AUC

An ROC curve, also called a receiver operating characteristic curve, employs a graph to assess the performance of a classification model. The ROC curve plots the relationship between the TPR (sensitivity) and the FPR. The ROC curve of a model can be generated by gradually changing the threshold for performing classification. Accordingly, each point in the curve corresponds to a specific decision threshold with its consequent TPR and FPR pair. Therefore, this metric applies to models with such thresholds, such as those using a logistic function for computing the final class. For example,
Figure 17.4: ROC and AUC
if we have a low classification threshold, then we can classify more items as positive, leading to an increase in both False Positives and True Positives. Fig. 17.4 illustrates a typical ROC curve.
As shown in the figure, the best model, or a perfect one, delivers the dark green ROC curve. The performance deteriorates as the ROC curve moves away from the green curve. The deterioration continues until the red curve, which represents the performance of a random model.
The area under the ROC curve, which is termed AUC-ROC (Area Under the Curve) or just AUC (Area Under the ROC Curve), provides a quantitative way of describing the model performance. AUC is used to evaluate the quality of a model's predictive ability regardless of the selected threshold.

Logarithmic Loss

AUC-ROC quantifies the model's ability to discriminate between two different classes. However, it does not provide a way to update the predicted probabilities. By contrast, Logarithmic loss or log loss, as a probabilistic metric, measures the difference between the predicted probabilities and the actual class labels. Therefore, log loss is particularly useful for handling models that output probabilities, e.g., logistic regression or ANNs. The use of this function provides feedback for model evaluation during training and thus helps improve the predicted probabilities.
The following equation for the log loss , log loss , log loss,ℓ\log \operatorname{loss}, \elllogloss,, is presented here for general classification tasks. A similar math formulation was introduced in the subsection for cross-entropy.
(17.124) = 1 I i = 1 I [ y i ln ( P i ) + ( 1 y i ) ln ( 1 P i ) ] (17.124) = 1 I i = 1 I y i ln P i + 1 y i ln 1 P i {:(17.124)ℓ=-(1)/(I)sum_(i=1)^(I)*[y_(i)ln(P_(i))+(1-y_(i))ln(1-P_(i))]:}\begin{equation*} \ell=-\frac{1}{I} \sum_{i=1}^{I} \cdot\left[y_{i} \ln \left(P_{i}\right)+\left(1-y_{i}\right) \ln \left(1-P_{i}\right)\right] \tag{17.124} \end{equation*}(17.124)=1Ii=1I[yiln(Pi)+(1yi)ln(1Pi)]
where y i y i y_(i)y_{i}yi tells whether sample i i iii belongs to class 1 (e.g., value 1 for "belong to"), P i P i P_(i)P_{i}Pi indicates the probability of sample i i iii belonging to class 1 , i.e., "Positive". Log loss has a range of [ 0 , ) [ 0 , ) [0,oo)[0, \infty)[0,). That is, this metric is positive and has no upper bound limit. In general, a value near 0 indicates higher accuracy, whereas a high loss value indicates low accuracy.
Log loss is also suitable for multi-class classification problems. The above equation needs to be adjusted to allow for multiple classes. This will be introduced in the following subsection.

17.4.3. Classification: Multi-Class

Indirect Methods

Indirect methods for extending the above evaluation metrics from binary classification to multi-class classification involve the decomposition of one multi-class classification problem into multiple binary classification problems in a one-vs-all or a one-vs-one fashion. Taking one-vs-all for example, one class of interest is viewed as positive, and the other classes are labeled as negative. In this way, the binary classification metrics such as precision, recall, and f1 score can be applied to
each binary classification problem and then averaged to yield the metric value for the multi-class classification problem. Depending on the average approach, we can have at least three indirect methods.
  • Macro Average: First calculate the metric value for each binary classification problem, and then obtain the arithmetic average. For example, one classification problem with K K KKK classes can be transformed into K 1 K 1 K-1K-1K1 binary classification problems. Then, we can calculate the macro average metric as
(17.125) Precision k = T P k T P k + F P k (17.126) Precision macro = k = 1 K 1 Precision k K 1 (17.125)  Precision  k = T P k T P k + F P k (17.126)  Precision  macro  = k = 1 K 1  Precision  k K 1 {:[(17.125)" Precision "_(k)=(TP_(k))/(TP_(k)+FP_(k))],[(17.126)" Precision "_("macro ")=(sum_(k=1)^(K-1)" Precision "_(k))/(K-1)]:}\begin{align*} \text { Precision }_{k} & =\frac{T P_{k}}{T P_{k}+F P_{k}} \tag{17.125}\\ \text { Precision }_{\text {macro }} & =\frac{\sum_{k=1}^{K-1} \text { Precision }_{k}}{K-1} \tag{17.126} \end{align*}(17.125) Precision k=TPkTPk+FPk(17.126) Precision macro =k=1K1 Precision kK1
  • Micro Average: Gather the results from all binary classification problems and then use these results to calculate the metric for the multi-class problem.
(17.127) Precision micro = k = 1 K 1 T P k k = 1 K 1 T P k + k = 1 K 1 F P k (17.127)  Precision  micro  = k = 1 K 1 T P k k = 1 K 1 T P k + k = 1 K 1 F P k {:(17.127)" Precision "_("micro ")=(sum_(k=1)^(K-1)TP_(k))/(sum_(k=1)^(K-1)TP_(k)+sum_(k=1)^(K-1)FP_(k)):}\begin{equation*} \text { Precision }_{\text {micro }}=\frac{\sum_{k=1}^{K-1} T P_{k}}{\sum_{k=1}^{K-1} T P_{k}+\sum_{k=1}^{K-1} F P_{k}} \tag{17.127} \end{equation*}(17.127) Precision micro =k=1K1TPkk=1K1TPk+k=1K1FPk
  • Weighted Average: Modify macro average with different weights for the metrics from the different binary classification problems to consider possible data imbalance issues.
(17.128) Precision macro = k = 1 K 1 Precision k × w k K 1 (17.128)  Precision  macro  = k = 1 K 1  Precision  k × w k K 1 {:(17.128)" Precision "_("macro ")=(sum_(k=1)^(K-1)" Precision "_(k)xxw_(k))/(K-1):}\begin{equation*} \text { Precision }_{\text {macro }}=\frac{\sum_{k=1}^{K-1} \text { Precision }_{k} \times w_{k}}{K-1} \tag{17.128} \end{equation*}(17.128) Precision macro =k=1K1 Precision k×wkK1
where w k w k w_(k)w_{k}wk is the weight for the k k kkk th binary classification problem.
The macro method considers all classes equally. Therefore, classes with few samples may be unnecessarily emphasized. By contrast, the micro method can better consider cases with imbalanced classes. In fact, the selection of these indirect methods may not matter too much for tasks with balanced classes. In addition, such selection also relies on how we consider the significance of classes with fewer samples. If we want to emphasize them, the macro method could be a better choice. When a more delicate consideration is needed, we can also resort to the weighted average method.

Confusion Matrix

The confusion matrix can be applied to multi-class classification without making major changes. We just need to replace the Positive and Negative labels with multi-class labels such as " 1 , 2 , 3 , 1 , 2 , 3 , 1,2,3,dots1,2,3, \ldots1,2,3,.. As a result, the core of the confusion will have a size much greater than 2 × 2 2 × 2 2xx22 \times 22×2. We can get a good understanding of the confusion matrix by implementing the following code to generate a confusion matrix for a classification problem with 10 classes.
from sklearn import datasets # Dateset
              from sklearn.neighbors import KNeighborsClassifier
              from sklearn.metrics import confusion_matrix, roc_auc_score, f1_score # Metric
              # Loading the dataset
              X, y_true = datasets.make_blobs(n_samples=500, centers=10, n_features=2, random_state=0) # Generate
                5 0 0 \text { samples from 10 classes}
              # Fitting using K-Nearest Neighbors Classifier
              knnc = KNeighborsClassifier(n_neighbors=2).fit(X,y_true)
              y_pred = knnc.predict(X)
              # Print evaluation result
              print(confusion_matrix(y_true,y_pred))
              from sklearn.metrics import ConfusionMatrixDisplay
              ConfusionMatrixDisplay.from_predictions(y_true,y_pred,cmap = 'plasma')
              
The above code generates Fig. 17.5 for the confusion matrix.
The diagonal cells in light colors are the correctly predicted samples. The numbers in these cells show the corresponding numbers of samples. Off the diagonal, for example, the cell in the first column and second row tells that, four samples, which have a true label of 2 , were predicted to be 0 .
Figure 17.5: Confusion matrix for multi-class classification problem

Logarithmic Loss

The log loss log loss log loss\log \operatorname{loss}logloss can also be applied to multi-class classification. Based on the same cross-entropy concept, a general mathematical formula for the log loss in multi-class classification tasks is usually constructed as follows.
(17.129) = 1 I i = 1 I k = 1 K y i k ln ( P i k ) (17.129) = 1 I i = 1 I k = 1 K y i k ln P i k {:(17.129)ℓ=-(1)/(I)sum_(i=1)^(I)sum_(k=1)^(K)y_(ik)*ln(P_(ik)):}\begin{equation*} \ell=-\frac{1}{I} \sum_{i=1}^{I} \sum_{k=1}^{K} y_{i k} \cdot \ln \left(P_{i k}\right) \tag{17.129} \end{equation*}(17.129)=1Ii=1Ik=1Kyikln(Pik)
where y i k y i k y_(ik)y_{i k}yik tells whether sample i i iii belongs to class k k kkk (e.g., value 1 for "'belong to"), P i k P i k P_(ik)P_{i k}Pik indicates the probability of sample i i iii belonging to class k k kkk. Log loss has a range of [ 0 , ) [ 0 , ) [0,oo)[0, \infty)[0,). This way of labeling the samples is the widely-adopted "one-hot" encoding. Attributed to the adoption of this labeling method, the use of the above equation can be fairly simple because only the term containing P i k P i k P_(ik)P_{i k}Pik, which the sample belongs to, needs to be considered while the other terms are zeros.

Kappa Coefficient

The kappa coefficient, also known as Cohen's kappa coefficient, quantifies the agreement between two sets of multiclass labels. If one set of the labels is true, then the other set can be the predictions from the classification task to be evaluated. The kappa coefficient has the following formula:
(17.130) κ = p p e 1 p e = p i = 1 K ( I k I ~ k ) / I 2 1 i = 1 K ( I k I ~ k ) / I 2 (17.130) κ = p p e 1 p e = p i = 1 K I k I ~ k / I 2 1 i = 1 K I k I ~ k / I 2 {:(17.130)kappa=(p-p_(e))/(1-p_(e))=(p-sum_(i=1)^(K)(I_(k)* tilde(I)_(k))//I^(2))/(1-sum_(i=1)^(K)(I_(k)* tilde(I)_(k))//I^(2)):}\begin{equation*} \kappa=\frac{p-p_{e}}{1-p_{e}}=\frac{p-\sum_{i=1}^{K}\left(I_{k} \cdot \tilde{I}_{k}\right) / I^{2}}{1-\sum_{i=1}^{K}\left(I_{k} \cdot \tilde{I}_{k}\right) / I^{2}} \tag{17.130} \end{equation*}(17.130)κ=ppe1pe=pi=1K(IkI~k)/I21i=1K(IkI~k)/I2
where p p ppp is the precision, I I III is the number of all the samples, I k I k I_(k)I_{k}Ik is the number of the samples in Class k k kkk, and I ~ k I ~ k tilde(I)_(k)\tilde{I}_{k}I~k is the number of samples with a predicted label of k k kkk.
The values of the kappa coefficient can vary from -1 to 1 in theory. But the range in most applications is [ 0 , 1 ] [ 0 , 1 ] [0,1][0,1][0,1]. A value of [ 0 , 0.2 ] [ 0 , 0.2 ] [0,0.2][0,0.2][0,0.2] implies very low similarity, [ 0.2 , 0.4 ] [ 0.2 , 0.4 ] [0.2,0.4][0.2,0.4][0.2,0.4] is low, [ 0.4 , 0.6 ] [ 0.4 , 0.6 ] [0.4,0.6][0.4,0.6][0.4,0.6] is medium, [ 0.6 , 0.8 ] [ 0.6 , 0.8 ] [0.6,0.8][0.6,0.8][0.6,0.8] is good, and [ 0.8 , 1.0 ] [ 0.8 , 1.0 ] [0.8,1.0][0.8,1.0][0.8,1.0] is excellent agreement.

Hinge Loss

The hinge loss was originally proposed for the use of SVM for solving binary classification problems. A general formulation for this purpose is as follows:
(17.131) hinge = max ( 0 , 1 y i y ~ i ) (17.131) hinge  = max 0 , 1 y i y ~ i {:(17.131)ℓ_("hinge ")=max(0,1-y_(i)* tilde(y)_(i)):}\begin{equation*} \ell_{\text {hinge }}=\max \left(0,1-y_{i} \cdot \tilde{y}_{i}\right) \tag{17.131} \end{equation*}(17.131)hinge =max(0,1yiy~i)
where y y yyy is the actual sample label, e.g., 1 and -1 , and y ~ y ~ tilde(y)\tilde{y}y~ is the prediction. It is noted that y ~ y ~ tilde(y)\tilde{y}y~ here is not necessarily the predicted label. It can be the raw calculation results like f ( x ) f ( x ) f( vec(x))f(\vec{x})f(x). For example, this f ( x ) f ( x ) f( vec(x))f(\vec{x})f(x) can be a linear model ( y = w T x + b ) y = w T x + b (y= vec(w)^(T)*( vec(x))+b)\left(y=\vec{w}^{T} \cdot \vec{x}+b\right)(y=wTx+b) or an ANN. To use this loss as an evaluation metric, we can obtain the total or average hinge loss for all the samples.
The Hinge loss can be extended to multi-class classification in two different ways. The first way is to use the indirect methods by breaking down the multi-class classification problem into multiple binary classification problems in a one-vs-all or a one-vs-one fashion. The second way is to modify the formula directly. There are multiple variations for this purpose. The following is a popular one.
(17.132) hinge = max ( 0 , 1 max i k ( y ~ i j ) y i k ) (17.132) hinge  = max 0 , 1 max i k y ~ i j y i k {:(17.132)ℓ_("hinge ")=max(0,1-max_(i!=k)( tilde(y)_(ij))-y_(ik)):}\begin{equation*} \ell_{\text {hinge }}=\max \left(0,1-\max _{i \neq k}\left(\tilde{y}_{i j}\right)-y_{i k}\right) \tag{17.132} \end{equation*}(17.132)hinge =max(0,1maxik(y~ij)yik)
where Sample i i iii has a true label of k . y i k k . y i k k.y_(ik)k . y_{i k}k.yik can be the probability or score for the k k kkk th class.

17.4.4. Classification: Multi-Label

Hamming Distance

Hamming distance is suitable for multi-label classification problems. It compares two classification results, which are usually the predicted results and the true labels. As a "distance" measure, it quantifies the difference or dissimilarity between the two classification results.
(17.133) D hamming ( y ~ i , y i ) = 1 L l = 1 L Sign ( y ~ i l y i l ) (17.133) D hamming  y ~ i , y i = 1 L l = 1 L Sign y ~ i l y i l {:(17.133)D_("hamming ")( tilde(y)_(i),y_(i))=(1)/(L)sum_(l=1)^(L)Sign( tilde(y)_(il)!=y_(il)):}\begin{equation*} D_{\text {hamming }}\left(\tilde{y}_{i}, y_{i}\right)=\frac{1}{L} \sum_{l=1}^{L} \operatorname{Sign}\left(\tilde{y}_{i l} \neq y_{i l}\right) \tag{17.133} \end{equation*}(17.133)Dhamming (y~i,yi)=1Ll=1LSign(y~ilyil)
where the above Hamming distance is for sample i i iii, which has L L LLL labels.

Jaccard Similarity Coefficient

Jaccard Similarity Coefficient measures the similarity by checking the numbers of samples predicted by two models, e.g., the predictions made by the model of interest and the true labels.
(17.134) J ( y ~ i , y i ) = | y ~ i y i | | y ~ i y i | (17.134) J y ~ i , y i = y ~ i y i y ~ i y i {:(17.134)J( tilde(y)_(i),y_(i))=(| tilde(y)_(i)nny_(i)|)/(| tilde(y)_(i)uuy_(i)|):}\begin{equation*} J\left(\tilde{y}_{i}, y_{i}\right)=\frac{\left|\tilde{y}_{i} \cap y_{i}\right|}{\left|\tilde{y}_{i} \cup y_{i}\right|} \tag{17.134} \end{equation*}(17.134)J(y~i,yi)=|y~iyi||y~iyi|
where | y ~ i y i | y ~ i y i | tilde(y)_(i)nny_(i)|\left|\tilde{y}_{i} \cap y_{i}\right||y~iyi| and | y ~ i y i | y ~ i y i | tilde(y)_(i)uuy_(i)|\left|\tilde{y}_{i} \cup y_{i}\right||y~iyi| are the numbers of samples in both sets and in only one of them, respectively.

17.4.5. Regression

Root Mean Squared Error

Root Mean Squared Error (RMSE) is among the most popular metrics, if not the most popular one, for regression problems. RMSE is defined by the standard deviation of the prediction errors. These prediction errors, which are also called residuals in some places, measure the distance of data points from the regression line.
(17.135) R M S E = 1 I i = 1 I ( y i y ~ i ) 2 (17.135) R M S E = 1 I i = 1 I y i y ~ i 2 {:(17.135)RMSE=sqrt((1)/(I)sum_(i=1)^(I)(y_(i)- tilde(y)_(i))^(2)):}\begin{equation*} R M S E=\sqrt{\frac{1}{I} \sum_{i=1}^{I}\left(y_{i}-\tilde{y}_{i}\right)^{2}} \tag{17.135} \end{equation*}(17.135)RMSE=1Ii=1I(yiy~i)2
where ( y i y ~ i ) 2 y i y ~ i 2 (y_(i)- tilde(y)_(i))^(2)\left(y_{i}-\tilde{y}_{i}\right)^{2}(yiy~i)2 is the square of the difference between the predicted and actual values for Sample i i iii, and I I III is the number of samples.
RMSE tells us how well the data points are clustered around the regression line. In particular, RMSE is efficient at handling a large number of data points, leading to more reliable error construction. However, it is noted that RMSE is heavily influenced by outliers (data points that differ significantly from others). Therefore, the exclusion of outliers can be critical before the use of RMSE. RMSE can be easily obtained with many machine learning packages. Taking Scikit-learn for example, the following code gives out RMSE if we know the true labels and predicted labels.
Most of the regression metrics to be introduced next can be obtained in a similar way.

Mean Absolute Error

Mean Absolute Error (MAE) is the average of the differences between the predicted and actual labels. MAE also measures the average magnitude of errors, i.e., how far the predictions are from the actual values. However, MAE does not inform the direction of errors, i.e., whether we are overfitting or underfitting the data.
(17.136) M A E = 1 I i = 1 I | y i y ~ i | (17.136) M A E = 1 I i = 1 I y i y ~ i {:(17.136)MAE=(1)/(I)sum_(i=1)^(I)|y_(i)- tilde(y)_(i)|:}\begin{equation*} M A E=\frac{1}{I} \sum_{i=1}^{I}\left|y_{i}-\tilde{y}_{i}\right| \tag{17.136} \end{equation*}(17.136)MAE=1Ii=1I|yiy~i|

Mean Squared Error

Mean Squared Error (MSE) is defined as the average of the squares of the differences between the predicted and actual values.
(17.137) M S E = 1 I i = 1 I ( y i y ~ i ) 2 (17.137) M S E = 1 I i = 1 I y i y ~ i 2 {:(17.137)MSE=(1)/(I)sum_(i=1)^(I)(y_(i)- tilde(y)_(i))^(2):}\begin{equation*} M S E=\frac{1}{I} \sum_{i=1}^{I}\left(y_{i}-\tilde{y}_{i}\right)^{2} \tag{17.137} \end{equation*}(17.137)MSE=1Ii=1I(yiy~i)2
MSE, together with MAE and other metrics, is commonly employed to construct the loss function in the optimization/solution of machine learning problems. In such optimization applications, e.g., loss reduction in backpropagation, the computation of gradients in MSE is easier than in MAE, which requires computational tools to compute gradients.
Also, speaking of sensitivity to outliers, the order is MSE > > >>> RMSE > > >>> MAE. RMSE, which is more sensitive to outliers than MAE but less sensitive than MSE. Therefore, RMSE can be considered if a compromise between MAE and MSE is desired. By contrast, MAE is preferred if we care more about small errors than large ones or if the data contains many outliers. MSE stands out when we want to focus on large errors or outliers.

Root Mean Squared Logarithmic Error

The Root Mean Squared Logarithmic Error (RMSLE) adopts the log of the predicted and actual values. RMSLE can be considered if we do not want to penalize big differences (errors) between the predicted and the actual values.
(17.138) R M S L E = 1 I i = 1 I [ ln ( y i + 1 ) ln ( y ~ i + 1 ) ] 2 (17.138) R M S L E = 1 I i = 1 I ln y i + 1 ln y ~ i + 1 2 {:(17.138)RMSLE=sqrt((1)/(I)sum_(i=1)^(I)[ln(y_(i)+1)-ln( tilde(y)_(i)+1)]^(2)):}\begin{equation*} R M S L E=\sqrt{\frac{1}{I} \sum_{i=1}^{I}\left[\ln \left(y_{i}+1\right)-\ln \left(\tilde{y}_{i}+1\right)\right]^{2}} \tag{17.138} \end{equation*}(17.138)RMSLE=1Ii=1I[ln(yi+1)ln(y~i+1)]2

R 2 R 2 R^(2)R^{2}R2 and Adjusted R 2 R 2 R^(2)R^{2}R2

R 2 R 2 R^(2)R^{2}R2, also known as the coefficient of determination, is a statistical measure of how closely the data points are fitted to the regression line. R 2 R 2 R^(2)R^{2}R2 values always lie between 0 % 0 % 0%0 \%0% to 100 % 100 % 100%100 \%100%. A R 2 R 2 R^(2)R^{2}R2 value of 0 % 0 % 0%0 \%0% indicates that the model predicts no ( 0 % 0 % 0%0 \%0% ) of the relationship between the input and output, e.g., a constant-only regression, while 100 % 100 % 100%100 \%100% marks that all of the data points are located on the regression line. Thus, a higher R 2 R 2 R^(2)R^{2}R2 value corresponds to a better model.
(17.139) R a 2 = 1 M S E Var = 1 1 I i = 1 I ( y i y ~ i ) 2 1 I i = 1 I ( y i y ¯ i ) 2 (17.139) R a 2 = 1 M S E Var = 1 1 I i = 1 I y i y ~ i 2 1 I i = 1 I y i y ¯ i 2 {:(17.139)R_(a)^(2)=1-(MSE)/(Var)=1-((1)/(I)sum_(i=1)^(I)(y_(i)- tilde(y)_(i))^(2))/((1)/(I)sum_(i=1)^(I)(y_(i)- bar(y)_(i))^(2)):}\begin{equation*} R_{a}^{2}=1-\frac{M S E}{\operatorname{Var}}=1-\frac{\frac{1}{I} \sum_{i=1}^{I}\left(y_{i}-\tilde{y}_{i}\right)^{2}}{\frac{1}{I} \sum_{i=1}^{I}\left(y_{i}-\bar{y}_{i}\right)^{2}} \tag{17.139} \end{equation*}(17.139)Ra2=1MSEVar=11Ii=1I(yiy~i)21Ii=1I(yiy¯i)2
However, R 2 R 2 R^(2)R^{2}R2 cannot determine whether the coefficient estimates and predictions are biased (towards certain independent variables or attributes). The adjusted R 2 R 2 R^(2)R^{2}R2 or R a 2 R a 2 R_(a)^(2)R_{a}^{2}Ra2 was proposed to address this issue. R a 2 R a 2 R_(a)^(2)R_{a}^{2}Ra2 tells the percentage of the variance for a dependent variable (label) that is explained by an independent variable.
(17.140) R a 2 = 1 ( 1 R 2 ) [ I 1 I ( J + 1 ) ] (17.140) R a 2 = 1 1 R 2 I 1 I ( J + 1 ) {:(17.140)R_(a)^(2)=1-(1-R^(2))[(I-1)/(I-(J+1))]:}\begin{equation*} R_{a}^{2}=1-\left(1-R^{2}\right)\left[\frac{I-1}{I-(J+1)}\right] \tag{17.140} \end{equation*}(17.140)Ra2=1(1R2)[I1I(J+1)]
where J J JJJ is the number of independent variables.
The adjusted R a 2 R a 2 R_(a)^(2)R_{a}^{2}Ra2 value increases only if the new variables/attributes improve the model. In other words, R a 2 R a 2 R_(a)^(2)R_{a}^{2}Ra2 decreases if we add useless/unnecessary/irrelevant variables to a model and increases if newly added variables are useful.

  1. sklearn.metrics.root_mean_squared_error (y_true, y_pred)

 

 

 

 

 

 

Enjoy and Build the AI World

Sample Code from AI Engineering

Cite the code in your publications

Linear Models